Introduction

Crucial to all good statistical analyses is a thorough exploratory analysis. In this document, we will introduce a few techniques for developing an understanding of our dataset, including an understanding of its limitations.

Load packages

Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.

library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)

File set up

If you have not done so already, you need to download the two set of files associated with the workshop.

All the workshop script and tutorial files can be found on Github at github.com/joenomiddlename/DFO_SDM_Workshop_2020. To download these files on your computer, press on Code and then Download ZIP. Once downloaded, unzip the DFO_SDM_Workshop_2020-main.zip file. To be able to access some of the precompiled data, we need to make the unzipped folder the working directory. To do so in R Studio, navigate to the correct folder using the bottom right panel (folder view, you can use button) and open it. Then, click “Set as Working Directory” under the tab ‘More’.

We should be inside the folder titled: ‘DFO_SDM_Workshop_2020’, you can verify this by using getwd().

Now we can load in the precompiled data

list2env(readRDS('./Data/Compiled_Data.rds'), globalenv())
## <environment: R_GlobalEnv>

Fin whale data

The data we use throughout this workshop is data on fin whales. We use two main data sources:

  • sightings from systematic DFO?? boat surveys

  • sightings from two? whale-watching operators

The sightings are from 201?. The sightings from the systematic survey, but not those of the whale-watching operators, are accompanied with the boat trackline.

Coordinate reference system

These data are spatial data. The first thing we must always do is check for consistency between the coordinate reference systems (CRS) of each spatial object in use! To print the CRS of an sp object, simply add @proj4string to the name of the spatial object and run. For example, to check the CRS of the WW sightings we run:

Sightings_DRWW_sp@proj4string 
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
We check the CRS of the other objects, but omit the code for brevity.
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs

All of the spatial objects are in lat/lon - good! For future analysis we will be projecting the data into a different coordinate reference system to better preserve euclidean distance.

Finally, let’s turn off all warnings associated with coordinate reference systems.

rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")

Plotting the data

Our first goal is generally to plot the data on a map. The gg() and gmap() functions from the inlabru package will prove extremely useful at plotting spatial data! Our spatial objects we will from the sp package. The class of objects from the sp package begin with ‘Spatial’. For example SpatialPointsDataFrame.

We have written a bespoke function gg.spatiallines_mod() to easily add SpatialLinesDatFrame objects to the plots too. This will prove useful for plotting transect lines. We load the bespoke functions to the working environment now.

source('utility_functions.R')

Let’s plot our data!

If the data are in lat/lon format then the gmap() function will automatically add a terrain layer from Google Maps to the plots. Let’s plot the survey sightings in blue, the survey tracklines in black, the whale watch sightings in purple, the whale watch ports in red.

gmap(Sightings_survey) +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey) +
  gg(Sightings_survey, colour='blue') +
  gg(Sightings_DRWW_sp, colour='purple') +
  gg(WW_ports, colour='red')
## Warning in proj4string(x): CRS object has comment, which is lost in output

By default gmap uses Google maps, see Additional tips for ways to use open-source maps for publication.

To remove the map layer, simply replace the gmap(Sightings_survey) with ggplot().

ggplot() + # Notice the empty ggplot() call
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey) +
  gg(Sightings_survey, colour='blue') +
  gg(Sightings_DRWW_sp, colour='purple') +
  gg(WW_ports, colour='red')

This plot hides some crucial information regarding the data collection. For example, the survey sightings and tracklines do not come from a single survey, or even a single organisation! Let’s plot this! The easiest way to do this is to subset the data accordingly!

table(Effort_survey$DATASET)
## 
##    DFO NOAA_1 NOAA_2 
##     49     54     70
# there are 3 surveys
ggplot() +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='DFO',], colour='purple') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_1',], colour='red') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_2',], colour='yellow')

This is problematic! The DFO tracklines (in purple) do not overlap with the two NOAA surveys! Thus, any future model will be unable to identify any differences in protocol efficiency. This is because any model intercepts will be confounded with the latent spatial field, but more on that later!

Exercise 1

In addition, the surveys were conducted across 4 separate years. Let’s plot the survey tracklines by year. Note that the names of the variables in the Effort_Survey are: DATASET and YEAR. Try this on your own! If you get stuck, click ‘Show Answer Code’. Hint: The YEAR variable is of type character and contains 4 unique values (see below).

table(Effort_survey$YEAR)
## 
## 2007 2008 2009 2011 
##  101   20   19   33
class(Effort_survey$YEAR)
## [1] "character"
ggplot() +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2007',], colour='purple') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2008',], colour='red') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2009',], colour='blue') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2011',], colour='yellow')
## Regions defined for each Polygons

Note that the surveys from years 2007, 2008 and 2009 covered largely different regions! Again, this is problematic if we want to model any changes in the whale distribution over time! The effect of year will be confounded by the spatial field. That being said, the data from 2011 appear to be a good candidate for model comparison as the spatial range overlaps with the other 3 years’ effort. We will holdout this data as our test data and use 2007, 2008, and 2009 as our training data.

xtabs(~ YEAR + DATASET, data=Effort_survey@data)
##       DATASET
## YEAR   DFO NOAA_1 NOAA_2
##   2007  49      3     49
##   2008   0     20      0
##   2009   0     19      0
##   2011   0     12     21

2011’s data comes exclusively from NOAA.

Transforming the data into a new CRS

For modelling, we will transform the data from lat/lon into a new “Canadian” CRS: NAD83 datum with UTM Zone 20N projection. This projection will help to preserve euclidean distance between points. The EPSG code is 2961.

To do the transformation, we will use the spTransform() function. For example, to transform the Whale Watch sightings spatial object Sightings_DRWW_sp, we first define the CRS object as follows:

Can_proj <- CRS("+init=EPSG:2961")
Can_proj <- fm_crs_set_lengthunit(Can_proj, unit='km')

The second line of code specifies that we want to work in units of km instead of the default meters. This can prove vital in applications to avoid numerical overflow.

Next, we transform Sightings_DRWW_sp:

Sightings_DRWW_sp <- spTransform(Sightings_DRWW_sp, Can_proj)
## Warning in proj4string(x): CRS object has comment, which is lost in output
Sightings_DRWW_sp@proj4string
## CRS arguments:
##  +init=EPSG:2961 +proj=utm +zone=20 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0
## +units=km +no_defs
Notice the changed output from calling @proj4string. We repeat this for all the spatial objects that are points or lines.
Sightings_Opp_sp <- spTransform(Sightings_Opp_sp, Can_proj)
## Warning in proj4string(x): CRS object has comment, which is lost in output
Sightings_survey <- spTransform(Sightings_survey, Can_proj)
## Warning in proj4string(x): CRS object has comment, which is lost in output
Effort_survey <- spTransform(Effort_survey, Can_proj)
## Warning in proj4string(x): CRS object has comment, which is lost in output
WW_ports <- spTransform(WW_ports, Can_proj)
## Warning in proj4string(x): CRS object has comment, which is lost in output

Transforming the ‘raster’-like SpatialPixelsDataFrame objects (Slope, Bathym, Dist_Brier, and Dist_Quoddy) using spTransform would be inappropriate here. The projection leads to a curvature of the pixels. A more appropriate approach here is to use bilinear interpolation. The projectRaster() function from the raster package works great for this. This requires converting the SpatialPixelsDataFrame object into an object of type raster. This is made easy with the function raster(). Finally, to convert the raster object back into a SpatialPixelsDataFrame, we can use the as() function from the maptools package. This function is extremely useful for converting spatial objects between the popular packages: sp, spatstat, and sf. We use this function substantially throughout the workshop.

Slope <- raster(Slope)
Slope <- projectRaster(Slope, crs=Can_proj)
Slope <- as(Slope, 'SpatialPixelsDataFrame') # Note the specification of class
# repeat for Bathym, combining into one single function call
Bathym <- as(projectRaster(raster(Bathym), crs=Can_proj), 'SpatialPixelsDataFrame')
Domain <- spTransform(Domain, Can_proj)
## Warning in proj4string(x): CRS object has comment, which is lost in output
Dist_Brier <- as(projectRaster(raster(Dist_Brier), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Quoddy <- as(projectRaster(raster(Dist_Quoddy), crs=Can_proj), 'SpatialPixelsDataFrame')

Plot the (transformed) Bathymetry, Slope, and Distance from Port spatial objects. We are going to combine these into a single plot using the multiplot() function from the inlabru package. This function takes as input ggplot objects and an argument layout, specifying how the plots should be arranged (see Additional tips for ways to change the layout).

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') + 
    coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope') + 
  coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') + 
  coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)')+ 
  coord_fixed(ratio = 1),
layout=matrix(1:4, nrow=2, ncol=2, byrow = TRUE))

Don’t like the colour scheme? See Additional tips to learn how to define your own manually.

Exercises

If you got stuck on any of the exercises, then please feel free to try them again. Here are links to the problems:

  1. Plotting data by year

Bonus Exercises

  1. Transform one of the spatial objects back into longitude latitude.
  2. Transform one of the spatial objects into units of meters.
  3. (More challenging) Investigate the relationship of distance from port with the two sources of whale watch data? What do you notice? What functional form could explain this type of behaviour? (Hint: use the over() function to extract the values of a spatial covariate (of type SpatialPixelsDataFrame) at point locations (stored as a SpatialPointsDataFrame object). Remember to subset the data accordingly!)

Additional tips and code

Using OpenStreetMaps instead of Google Maps

For publication, there can be issues regarding copyright of Google Maps. Using OpenStreetMap can help. To guarantee this simply add the following argument: source='osm', force=TRUE to gmap(). Double check the console that the maps are indeed being downloaded from stamen or osm. For brevity we have suppressed the messages.

gmap(Sightings_survey, source='osm', force=TRUE) +
  gg(Domain) +
  gg.spatiallines_mod(Effort_survey) +
  gg(Sightings_survey, colour='blue') +
  gg(Sightings_DRWW_sp, colour='purple') +
  gg(WW_ports, colour='red')

Note for this to work it needs to be in the original project (lat/lon), not UTM.

The multiplot() function is a very flexible function that enables publication-quality figures to be made with relative ease.

Changing the layout of multiplot()

To change the order of the plots you can change the argument byrow=TRUE to byrow=FALSE.

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry'),
ggplot() +
  gg(Domain) +
  gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope'),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(1:4, nrow=2, ncol=2, byrow = FALSE))

We can also change the size of the figures by changing the matrix in the layout.

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry'),
ggplot() +
  gg(Domain) +
  gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope'),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(c(1,1,2,3,4,4), nrow=3, ncol=2, byrow = TRUE))

As you can see plots of different size stretches the maps, to keep the set projection we can use coord_fixed(ratio = 1).

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
    coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope') +
  coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') + 
  coord_fixed(ratio = 1),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') + 
  coord_fixed(ratio = 1),
layout=matrix(c(1,1,2,3,4,4), nrow=3, ncol=2, byrow = TRUE))

Define your own colour palette

We can define the colour palette easily.

colsc <- function(...) {
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"PuBuGn")),
                       limits = range(...))
}

Look at ?RColorBrewer::brewer.pal to see what other colour palettes are available.

multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
  colsc(Bathym@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope') +
  colsc(Slope@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
  colsc(Dist_Brier@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
  colsc(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors

Have a go at creating your own colour palette function. Investigate the effects of changing both arguments to brewer.pal.

colsc2 <- function(...){
  scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(7,"Spectral")),
                       limits = range(...))
}
multiplot(ggplot() +
  gg(Domain) +
  gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
  colsc2(Bathym@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope') +
  colsc2(Slope@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
  colsc2(Dist_Brier@data[,1]),
ggplot() +
  gg(Domain) +
  gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
  colsc2(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))

Potential repeated whale-watch and survey sightings in 2011

Could we not also put 2011’s whale-watch data into the training set? Below is some code showing that we cannot. A whale is sighted by at least one whale watch company on every day in 2011 that a survey detects a whale. The plot below shows that these sightings could have been of the same animal. Even if they weren’t, the tracklines in 2011 still visited areas that the whale watch vessels were present. Since our training and test datasets are required to be independent of each other, we must therefore also remove the 2011 whale watch sightings from the training data.

# what dates were survey sightings made on in 2011?
unique(Sightings_survey$DATE_LO[Sightings_survey$YEAR==2011])
## [1] "2011-06-21" "2011-08-12" "2011-08-13"
# Are sightings by the WW vessels made on these dates?
sum(grepl(Sightings_survey$DATE_LO[Sightings_survey$YEAR==2011][1],
     x=c(Sightings_Quoddy_nodup$WS_DATE[Sightings_Quoddy_nodup$YEAR==2011],
         Sightings_Brier_nodup$WS_DATE[Sightings_Brier_nodup$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey$DATE_LO[Sightings_survey$YEAR==2011][2],
     x=c(Sightings_Quoddy_nodup$WS_DATE[Sightings_Quoddy_nodup$YEAR==2011],
         Sightings_Brier_nodup$WS_DATE[Sightings_Brier_nodup$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey$DATE_LO[Sightings_survey$YEAR==2011][3],
     x=c(Sightings_Quoddy_nodup$WS_DATE[Sightings_Quoddy_nodup$YEAR==2011],
         Sightings_Brier_nodup$WS_DATE[Sightings_Brier_nodup$YEAR==2011])))>0
## [1] TRUE
# Could they be of the same animal? 
ggplot() + gg(Domain) + 
  gg(Sightings_survey[Sightings_survey$YEAR==2011,],colour='green') +
  gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR==2011,],colour='yellow')

Acknowledgements

The code for hiding the code chunks came from Martin Schmelzer, found here


  1. Alternatively, we can use the function over() from the sp package. We choose to create im objects so that we can later use the ppm function from spatstat for exploratory analysis purposes. ppm fits Poisson process models, equivalent to Maxent models.↩︎